Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement
نویسندگان
چکیده
We address unsupervised audio-visual speech enhancement based on variational autoencoders (VAEs), where the prior distribution of clean spectrogram is simulated using an encoder-decoder architecture. At (test) time, trained generative model (decoder) combined with a noise whose parameters need to be estimated. The initialization latent variables describing process via decoder, crucial, as overall inference problem non-convex. This usually done by output encoder given noisy audio and visual data input. Current VAE models do not provide effective because two modalities are tightly coupled (concatenated) in associated architectures. To overcome this issue, we introduce mixture networks autoencoder (MIN-VAE). Two input, respectively, data, posterior modeled Gaussian distributions from each encoder. variable also latent, therefore learning optimal balance between encoders well. By training shared network learns adaptively fuse modalities. Moreover, at test encoder, taking (clean) used for initialization. A approach derived train proposed model. Thanks novel procedure robust initialization, MIN-VAE exhibits superior performance than standard audio-only well counterparts.
منابع مشابه
Audio Visual Speech Enhancement
This thesis presents a novel approach to speech enhancement by exploiting the bimodality of speech production and the correlation that exists between audio and visual speech information. An analysis into the correlation of a range of audio and visual features reveals significant correlation to exist between visual speech features and audio filterbank features. The amount of correlation was also...
متن کاملInventory-Based Audio-Visual Speech Enhancement
In this paper we propose to combine audio-visual speech recognition with inventory-based speech synthesis for speech enhancement. Unlike traditional filtering-based speech enhancement, inventory-based speech synthesis avoids the usual trade-off between noise reduction and consequential speech distortion. For this purpose, the processed speech signal is composed from a given speech inventory whi...
متن کاملAudio-visual enhancement of speech in noise.
A key problem for telecommunication or human-machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach--that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach ...
متن کاملUsing twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition
In this paper we propose the use of the recently introduced twinHMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the socalled twin HMMs. The adopted fr...
متن کاملIntroducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement
Models for automatic speech recognition (ASR) hold detailed information about spectral and spectro-temporal characteristics of clean speech signals. Using these models for speech enhancement is desirable and has been the target of past research efforts. In such model-based speech enhancement systems, a powerful ASR is imperative. To increase the recognition rates especially in low-SNR condition...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Signal Processing
سال: 2021
ISSN: ['1053-587X', '1941-0476']
DOI: https://doi.org/10.1109/tsp.2021.3066038